[PROTOCOL RFC] Full Void Type Support#7073
Conversation
| ## Void columns without the table feature | ||
|
|
||
| When the `voidType` feature is not supported, `void` columns can only be **omitted**. Because a `void` column is never written to a data file, writers must reject **writing data** to a table whose schema contains any of the following shapes, in which omitting the `void` column(s) would leave nowhere to record the nullability or length of an enclosing value: | ||
| - a `void` type directly inside an `array` or `map` at any nesting level; |
There was a problem hiding this comment.
For map, shall we specify that it is allowed only for the value? I recall that we don't allow VOID keys anyway, right?
Also, by 'directly,' we mean arrays like ARRAY<VOID>. But with this feature, are we also going to unblock void indirectly inside an array or map, right? Like ARRAY<STRUCT<INT, VOID>>?
There was a problem hiding this comment.
For
map, shall we specify that it is allowed only for the value? I recall that we don't allow VOID keys anyway, right?
That is Spark limitation, and I don't think it belongs in Delta Protocol. FWIW The Protocol does not say anything about nullability of map keys in general.
Also, by 'directly,' we mean arrays like
ARRAY<VOID>. But with this feature, are we also going to unblockvoidindirectly inside anarrayormap, right? LikeARRAY<STRUCT<INT, VOID>>?
Indirect voids are already unblocked, it's just an implementation detail that Spark connector blocks this, but otherwise there's no reason for blocking. This feature does not affect those voids.
|
|
||
| When Void Type is supported (when the `writerFeatures` field of a table's `protocol` action contains `voidType`), writers: | ||
| - must store the table's structural `void` columns as `UNKNOWN` (see [Structural void columns](#structural-void-columns)). | ||
| - may store any non-structural `void` column either by omission or as an `UNKNOWN` column. |
There was a problem hiding this comment.
Why don't we force writers to always write non-structural void columns by omission?
There was a problem hiding this comment.
Because that would make the protocol more difficult to implement and is not strictly necessary. It only makes dropping the feature easier if we have less UNKNOWNs.
But I'll update it to "should" to show it's the preferred behavior.
|
|
||
| A `void` column in any other position is never structural: it can be omitted, and does not require the feature. A schema that contains one of the shapes above is said to **require** the `voidType` feature. | ||
|
|
||
| ### Writer Requirements for Void Type |
There was a problem hiding this comment.
Shall we also explain how writers should handle statistics for VOID columns?
There was a problem hiding this comment.
I don't think we need any special handling for stats, so the default rules from the protocol should be sufficient.
| # Void Type | ||
| **Associated Github issue for discussions: https://github.com/delta-io/delta/issues/7072** | ||
|
|
||
| This protocol change adds support for using the `void` data type (also known as `NullType` in Spark, `UnknownType` in Iceberg, and `UNKNOWN` in Parquet) anywhere in a Delta table schema, via a new reader/writer table feature, `voidType`. |
There was a problem hiding this comment.
| This protocol change adds support for using the `void` data type (also known as `NullType` in Spark, `UnknownType` in Iceberg, and `UNKNOWN` in Parquet) anywhere in a Delta table schema, via a new reader/writer table feature, `voidType`. | |
| The `voidType` reader/writer table feature adds support for using the `void` data type (also known as `NullType` in Spark, `UnknownType` in Iceberg, and `UNKNOWN` in Parquet) anywhere in a Delta table schema. |
|
|
||
| `void` is a data type with a single possible value: `NULL`. A column ends up with this type when the writer has no information about its actual type, typically because every value observed so far has been `NULL` (for example, `CREATE TABLE t AS SELECT NULL AS a`, or schema evolution that adds a column containing only `NULL`s). | ||
|
|
||
| Today, `void` columns are represented by omitting them from data files and reconstructing them as all-`NULL` columns on read (the missing columns mechanism). That representation cannot encode four schema shapes - a table whose columns are all `void`, a `struct` whose fields are all `void`, a `void` nested in an `array`, and a `void` nested in a `map` - because in each case omitting the `void` column(s) would leave the enclosing `struct`, `array`, or `map` (or the table itself) with nothing written to a data file, and therefore nowhere to record whether the enclosing value is `NULL`, empty, or how long it is. Writers must reject writing data in those cases. |
There was a problem hiding this comment.
That's true only for some engines like Spark. I think it's officially undefined since it is not supported.
There was a problem hiding this comment.
It's becoming official in #6966. I built this RFC assuming that protocol clarification makes it in.
Before that change, the protocol basically said tables can have void, behavior is undefined but it's recommended to drop it upon reads.
| - a `struct` (at any nesting level) whose fields are all `void`; or | ||
| - a table whose columns are all `void`. | ||
|
|
||
| These restrictions are stated in terms of the **table schema**, not the schema of any individual data file. A table with such a schema can still be created, altered through metadata-only operations, and read. It can be made writable by evolving its schema - for example, by changing a `void` column to another type - or by enabling the `voidType` feature. |
There was a problem hiding this comment.
If the column is omitted, then the schema of the individual data files are all the same, right?
There was a problem hiding this comment.
There can be differences due to schema evolution/type widening. Voids will always be missing, but other columns can differ.
|
|
||
| A `void` column may be changed to any other data type through supported schema-evolution operations; this does not require the [Type Widening](/PROTOCOL.md#type-widening) table feature, even when the `void` column is stored as `UNKNOWN`. | ||
|
|
||
| ## Void columns without the table feature |
There was a problem hiding this comment.
I'm not sure we can enforce anything on the case without a table feature. A legacy writer/reader does not have knowledge of this new proposal and cannot retroactively ban certain operations.
It seems to me, if someone wants the void type to behave as expected, they must have the table feature and from there it's up to the engine whether they want to omit or materialize the void type column.
There was a problem hiding this comment.
The situation is not ideal. void never officially made it into the protocol, but got accidentally introduced to tables by the Spark connector when Spark got NullType support. Then there were various revisions to the Protocol to make sure void is mentioned there, but it was all vague because even the Spark connector did not handle it properly causing query failures. After #6966, the behavior will be defined, and both Spark connector and kernel-rs follows that version of the Protocol.
I understand external clients may now become protocol-incompliant, but if they somehow managed to read what is written by Spark (which I think is the reference implementation) previously, and if they wrote something that Spark could read before, then they should still be protocol-compliant. In any case, this comment is more for #6966 than this PR.
| ### Reader Requirements for Void Type | ||
|
|
||
| When Void Type is supported (when the `readerFeatures` field of a table's `protocol` action contains `voidType`), readers: | ||
| - must recognize and tolerate a `void` data type anywhere in a Delta table schema. |
There was a problem hiding this comment.
This phrasing is a bit weird. "must allow"
|
|
||
| When Void Type is supported (when the `readerFeatures` field of a table's `protocol` action contains `voidType`), readers: | ||
| - must recognize and tolerate a `void` data type anywhere in a Delta table schema. | ||
| - must read a `void` column stored as `UNKNOWN` as an all-`NULL` column. |
There was a problem hiding this comment.
specify Parquet here, since this seems targeted.
Although maybe a more neutral way to frame this is "must return only null values for columns defined as void in the table schema". Whether it's omitted or materialized, the actual behaviour is that. It's less about the underlying data files' schema than it is about the actual Delta table's schema.
There was a problem hiding this comment.
Reworded it to not mention any type and just say return all null independent of representation.
Which Delta project/connector is this regarding?
Description
Associated Github issue for discussions: #7072
This PR adds the proposed protocol change for full VOID support everywhere in Delta table schema. Current protocol is being clarified in #6966 to specify how VOID currently needs to be handled, but this RFC further defines a new table feature that will allow tables to persist VOID columns as UNKNOWN type in Parquet, and hence lift the schema limitations we have today.
How was this patch tested?
N/A
Does this PR introduce any user-facing changes?
Creates a new Protocol RFC.